Introduction

This report reviews all of the statistical modeling and analysis results on data concerning various restaurants around the world. The purpose of this study is to discover and understand the root cause of what makes a successful restaurant, successful. The primary interest is to distinguish variables that have significance in regard to restaurant ratings and why that may be, in terms of various variables such as the location and status of the restaurant in question. Additionally, this report is designed to help visualize the trends in what it takes to encourage and retain high ratings within this industry.

The remainder of this report is organized as follows. (The Tidying Data section displays a detailed walkthrough of how the dataset was prepared, in order to help model and analyze each variable when testing the significance each of them play in Restaurant Ratings. Exploratory Analysis describes the initial look at some of the plots to help get a better understanding of the characteristics of some of the most significant variables. (FIRST AND SECOND ANALYSIS))

Tidying Data

Loading Files From Kaggle Source - Loading all libraries and files required for this analysis.

#------LOAD DATA------------
library(tidyverse)
library(readxl)
library(leaflet)
library(flexdashboard)
library(gridExtra)
library(broom)
library(plotly)
library(gridExtra)

#------LOAD FILES---------

CCodesfile="~/project/Country-Code.xlsx"
Zomatofile="~/project/zomato.xlsx"
CurrencyFile="~/project/CurrencyRates.xlsx"

CCodes=read_excel(CCodesfile) #importing country code file
zomato=read_excel(Zomatofile) #importing main zomato file
Currency=read_excel(CurrencyFile) #importing currency file

The original data set contained varying currencies per country, making it difficult to compare each restaurant uniformly. We created a seperate file containing the United States Dollar ($USD) conversion rates for each country, which will be used to convert current currency figures into new standard ones.

Data Wrangling - Although the data is presented pre-cleaned into appropriate columns, there still remains some level of manipulation that needs to be done.

#------ GET RID OF SOME COLUMNS & ADD NEW ONES -------------
zomato=zomato %>% select (-c(Rating_Color,Switch_To_Order_Menu,Address,Locality_Verbose,Currency))
#we remove currency column as it is incorrect and contains too many unique symbols

#add in new columns: Country Names according to Country Code & proper currency names with conversion rates
zomato = zomato %>% left_join(CCodes) 
zomato = zomato %>% left_join(Currency) 

#add in new column: USD Cost (for equal comparison)
zomato = zomato %>% mutate(Avg_Cost_USD = Average_Cost_For_Two*`Conversion Rate (USD)`) 
#Multiplied Currency Column by it's conversion rate

#replacing Cuisines col. with Principal_Cuisines (the first category of cuisines for every country row)
zomato= zomato %>% separate(Cuisines,into=c("Principal_Cuisines"),sep=',') #store "primary" cuisine types into new column and replace old with this

#Make sure rating categories are ordered:
zomato = zomato %>% mutate(Rating_Text=ordered(Rating_Text,levels=c("Excellent","Very Good","Good","Average","Poor")))

We begin by removing unnecessary columns and adding in necessary ones:

  1. Add in countries by country code via the country code file provided by kaggle

  2. Add in our own currency conversion file, allowing us to convert average cost for two into the USD amount and storing it in it’s own column called “Avg Cost USD”.

We notice there were many cuisines listed per restaurant, so to simplify things for analytic purposes, we extract the first cuisine listed by each restaurant.

Removing Missing Values - Since we will focusing on the ratings of restaurants, we remove restaurants with missing aggregate rating values.

#---------REMOVE MISSING VALUES----------
#we notice that there are several "0" rated stores for even expensive places so we decided to remove no rating stores
zomato[zomato == 0] = NA #remove values that have zero
zomato[zomato == "Not rated"] = NA #remove unrated values
zomato = zomato %>% filter(!is.na(Aggregate_Rating))

india.data = zomato %>% filter(Country == "India")
other.data = zomato %>%  filter(Country!= "India")

Exploratory Analysis

Distribution Of Variables


INSERT COMMENT HERE

** How Variables React to Aggregate Rating**


INSERT COMMENT HERE

India Compared to the World - how does India behave in comparison to the other Countries?: Distribution


INSERT COMMENT HERE

India VS the World - Let’s see how the variabbles reacting to Aggregate Rating of India differ from how they react from other countries


INSERT COMMENT HERE

*Exploring India Data Only - where would the best place be in India


INSERT COMMENT HERE

First Stage Analysis

Linear Regression - Model with all the possible predictor variables in our dataset.

  Aggregate Rating
Predictors Estimates CI p
(Intercept) 3.24 1.53 – 4.95 <0.001
Observations 6504
R2 / adjusted R2 0.597 / 0.537

Since we are aware most of our data comes from New Delhi and most common cuisine is North Indian, we set our baseline for City and Principal Cuisines in respect to this. The first linear regression model includes all the following variables:

  • Aggregate Rating

  • City

  • Principal Cuisine

  • Table Booking Option

  • Online Delivery Option

  • Price Range

  • Average Cost for Two (USD)

This model yields a R squared value of 59.73% and is statistically significant. Included on the left is a table containing all the coefficent estimates for this model. When ordering p-values from lowest to highest, notice majority of the variable estimates returns p-values higher than 0.05.

Diagnostic Graphs (1) - Checking the model’s residual fitted graph for accuracy of this model


Our residuals seem to be rather random, with the bulk of it being around zero. The fitted line helps illustrate the overall trend of the residuals which in our case seems to fall slightly below zero.

Diagnostic Graphs (2) - Checking normality of residuals via QQ Plot


QQ Plot shows that the center region of the residuals are normal but the lower tail deviates heavily.

Diagnostic Graphs (3) - Checking for homoscedasticity via Scale Location


We ideally want to observe a rather straight horizontal line for this plot. Unfortunately, this scale location plot suggests heavy heteroscedasticty.

Drop Test - Checking if any variables can be dropped from the current model

Single term deletions

Model:
Aggregate_Rating ~ City + Locality + Principal_Cuisines + Has_Table_Booking + 
    Has_Online_Delivery + Price_Range + Avg_Cost_USD
                     Df Sum of Sq    RSS   AIC  F value    Pr(>F)    
<none>                            126430 20993                       
City                 19      1092 127522 21011   2.5716 0.0002015 ***
Locality            728     66826 193256 22297   4.1072 < 2.2e-16 ***
Principal_Cuisines   71     15068 141498 21583   9.4956 < 2.2e-16 ***
Has_Table_Booking     1      1298 127728 21058  58.0806 2.932e-14 ***
Has_Online_Delivery   1       150 126580 20999   6.7022 0.0096541 ** 
Price_Range           1         8 126438 20992   0.3380 0.5610341    
Avg_Cost_USD          1      3174 129604 21152 142.0286 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conducting a drop test using F statistic shows we can drop the variable Price Range from our model without it being affected.

Second Stage Analysis

Transforming Response Variable - We apply our first transformation to the dependent variable and observe its effects.

india.data = india.data %>% mutate(Transformed_Rating = log10(Aggregate_Rating/(5-Aggregate_Rating)))
  Transformed Rating
Predictors Estimates CI p
(Intercept) 0.27 -0.68 – 1.22 0.576
Observations 6504
R2 / adjusted R2 0.662 / 0.611

We add a new transformed variable to dataset called “Transformed Rating”. Since our Aggregate Rating level is limited between the values 0 to 5, we set a restriction that prevents rating estimates to go above or beyond this range. The transformation is as follows: \[log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}\]

The transformed linear regression model returns an improved R squared value of 66.2% (a 7% increase from original model).

Drop Test - Checking if any variables can be dropped from the transformed model

Single term deletions

Model:
Transformed_Rating ~ City + Locality + Principal_Cuisines + Has_Table_Booking + 
    Has_Online_Delivery + Avg_Cost_USD
                     Df Sum of Sq   RSS   AIC  F value  Pr(>F)    
<none>                            38947 13333                     
City                 19     257.8 39205 13338   1.9712 0.00710 ** 
Locality            728   27035.9 65983 15306   5.3950 < 2e-16 ***
Principal_Cuisines   71    5385.2 44333 14033  11.0187 < 2e-16 ***
Has_Table_Booking     1    1062.3 40010 13506 154.3239 < 2e-16 ***
Has_Online_Delivery   1      24.1 38971 13335   3.4981 0.06149 .  
Avg_Cost_USD          1    2435.6 41383 13725 353.8218 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conducting a drop test using F statistic shows we can now drop the variable Has Online Delivery from our transformed model without it being affected.

Diagnostic Graphs (1) - Fitting a residuals graph to check accuracy of the transformed model

Diagnostic Graphs (2) - Checking normality of residuals via QQ Plot


Our QQ Plot illustrates that most of the data does follow a relativly normal distribution of residuals. However, as with our original model, the upper and lower tails deviate from normality.

It’s important to note that most of our variables are heavily skewed, causing these large influenctial outliers. Since we are working with a large sample size, the normality shown by this graph doesn’t raise much red flags.

Diagnostic Graphs (3) - Checking for homoscedasticity


The Scale Location Plot shows that there is still clear heteroscedasticity. Ideally, we want the plots to demonstrate a relatively straight horizontal line - indicating homoscedasticity. Further analysis of our variables can explain how alaraming it is (or isn’t).

Checking Individual Variables (Average Cost for Two) - The linear regression of the Transformed Rating with Average Cost for Two returns a residual plot that follows some form of pattern

Checking Individual Variables (Average Cost for Two) - Referring back to the original plot graph between cost and rating, recall that an exponential relatioship was present. Taking the log of cost helped randomize the residual plot a lot more.

City Regression - Investigating if the predictor variable City can be transformed to improve the model


At first glance, there seems to be some form of pattern with the residuals for the predictor variable: City. However, this residual plot is not concern worthy as the dips illustrated by the fitted line is mainly caused by lack of data for certain parts of the graph. Otherwise, the plots are rather random.

Locality Regression - The residual plot for the regression of Transformed Rating ~ Locality is random and has a mean difference of 0 for the most part

Principal Cuisines Regression - Similar to the regression for City, we experience dips in certain areas of the residual graph due to missing data

Booking Regression - The traditional residual graph for the variable: Has Table Booking is hard to interpret as it is a binary variable

Booking Regression (2) - Using boxplots to graph the differences between the residuals and variable values, we see the median is well below zero - residuals not randomly scattered

Statistically Significant Predictors - List of localities that are significantly better or worse than baseline

# A tibble: 45 x 5
   term                              estimate std.error statistic  p.value
   <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
 1 LocalitySpot 18 Mall, Pimple Sau~   -0.611    0.194      -3.15  1.63e-3
 2 LocalityLaw College Road            -0.452    0.194      -2.33  2.00e-2
 3 LocalityBaner                       -0.421    0.202      -2.08  3.78e-2
 4 LocalityKondapur                    -0.351    0.0802     -4.38  1.21e-5
 5 LocalityOhri' Hitech City           -0.296    0.149      -1.98  4.75e-2
 6 LocalityPark Street Area             0.171    0.0821      2.08  3.77e-2
 7 LocalityLower Parel                  0.181    0.0762      2.38  1.73e-2
 8 LocalityFort                         0.199    0.0963      2.06  3.91e-2
 9 LocalityHill Road, Bandra West       0.248    0.0890      2.79  5.28e-3
10 LocalityVersova, Andheri West        0.248    0.0993      2.50  1.24e-2
# ... with 35 more rows

Statistically Significant Predictors - List of Cities that are significantly better or worse than baseline

# A tibble: 2 x 5
  term       estimate std.error statistic p.value
  <chr>         <dbl>     <dbl>     <dbl>   <dbl>
1 CityJaipur    0.250    0.0793      3.15 0.00165
2 CityMysore   -0.166    0.0781     -2.13 0.0335 

Conclusion

The dataset we chose contained recorded information on 9551 different restaurants from around the world. A sample size this large is bound to contain useful information pertaining to restaurant performance.

We began with data visualization in order to let the data paint its own picture and determine what areas are of interest. At first glance we notice that our data is heavily skewed, with more than 80% of the restaurants being located in India. With such a heavy skew, analyzing the data as a whole would be a mistake since the results of the other countries will be influenced by India.

As such, we isolate the restaurants in India from the rest of the world and further investigate. Our key indicator of restaurant performance is its rating measure - both Rating Text and Aggregate Rating - the first allowing for a general overview and the latter for more accurate regression analysis.

After identifying these trends among restaurants in India, we move on to analyze what variables actually affect restaurant performance. We identified a skew distribution in number of votes per restaurant during our exploratory analysis and take this into consideration when running our regression. We use votes as a weight parameter for our regression in order to obtain a better analysis.

Since we are working with a set range of ratings: 0 to 5, we set a limit to restrict regression estimates from going above or below this level and use this as our transformed response variable: \[0<log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}<5\]

After running diagnostic plot tests with our regression models we identify the following variables have correlation to restaurant rating:

  1. log (Average Cost for Two (USD))
  2. Locality - Khandari is much worse off, while Aminabad has better restaurants
  3. City - Jaipur has better ratings and Mysore has worse ratings than New Dehli
  4. Booking Option - Restaurants with the option to book are slightly better off.

It is important to note that our selected regression model does not directly return an estimated aggregate rating. Let Y be the estimated value the regression model outputs; in order to undo the transformation on our response variable, we must perform the following: \[Estimated Aggregate Rating = \frac{5*10^{Y}}{1+10^{Y}}\]

Discussion

(HYPOTHESIS VS RESULTS) (SUMMARIZE RESULTS) (EXPLAIN VALIDITY OF RESULTS)

This study is concerning a very large dataset with over 10,000 entries. Normally this would be associated with many benefits, as it provokes accurate results; however, our greatest problem with this dataset is that it was heavily based in one specific country: India. This became a limitation when it failed to display a fair comparison within restaurants across the world, as only a limited amount of information was retrieved for all the countries besides India. The uneven assortment of information made it impossible to conclude with the accuracy that was originally intended. The other problematic nature of this dataset is that it did not provide other relative measures that would have an impact on restaurant ratings, and what it takes for consumers to enjoy their experiences in such places. This dataset provides a handful of variables, which was used to model and analyze the effects they had on the overall popularity of restaurants. Other than these variables, there are many others to consider, such as the quality of service, layout, presentation, themes of the restaurants, and so on. In regard to this drawback, we have missed out on variables that may have had a significant impact on our findings. In a future study, these limitations can be avoided by collecting a more detailed selection of data. It would also be recommended to pay close attention to the variables in question and whether it would be helpful to merge data from different sources, which would likely cover a greater scale of subjects, and thus conclude with even more accurate findings.

---
output: 
  flexdashboard::flex_dashboard:
    theme: bootstrap
    source: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
```
Introduction {data-icon="fa-book-open"}
=====================================
This report reviews all of the statistical modeling and analysis results on data concerning various restaurants around the world. The purpose of this study is to discover and understand the root cause of what makes a successful restaurant, successful. The primary interest is to distinguish variables that have significance in regard to restaurant ratings and why that may be, in terms of various variables such as the location and status of the restaurant in question. Additionally, this report is designed to help visualize the trends in what it takes to encourage and retain high ratings within this industry. 

The remainder of this report is organized as follows. (The Tidying Data section displays a detailed walkthrough of how the dataset was prepared, in order to help model and analyze each variable when testing the significance each of them play in Restaurant Ratings. Exploratory Analysis describes the initial look at some of the plots to help get a better understanding of the characteristics of some of the most significant variables. (FIRST AND SECOND ANALYSIS))   

Tidying Data {data-icon="fa-table" .storyboard}
===================================== 

### **Loading Files From Kaggle Source** - Loading all libraries and files required for this analysis.

```{r echo=TRUE, message=FALSE, warning=FALSE}
#------LOAD DATA------------
library(tidyverse)
library(readxl)
library(leaflet)
library(flexdashboard)
library(gridExtra)
library(broom)
library(plotly)
library(gridExtra)

#------LOAD FILES---------

CCodesfile="~/project/Country-Code.xlsx"
Zomatofile="~/project/zomato.xlsx"
CurrencyFile="~/project/CurrencyRates.xlsx"

CCodes=read_excel(CCodesfile) #importing country code file
zomato=read_excel(Zomatofile) #importing main zomato file
Currency=read_excel(CurrencyFile) #importing currency file

```

***
The original data set contained varying currencies per country, making it difficult to compare each restaurant uniformly. We created a seperate file containing the United States Dollar ($USD) conversion rates for each country, which will be used to convert current currency figures into new standard ones.

### **Data Wrangling** - Although the data is presented pre-cleaned into appropriate columns, there still remains some level of manipulation that needs to be done. 

```{r echo=TRUE, warning=FALSE}
#------ GET RID OF SOME COLUMNS & ADD NEW ONES -------------
zomato=zomato %>% select (-c(Rating_Color,Switch_To_Order_Menu,Address,Locality_Verbose,Currency))
#we remove currency column as it is incorrect and contains too many unique symbols

#add in new columns: Country Names according to Country Code & proper currency names with conversion rates
zomato = zomato %>% left_join(CCodes) 
zomato = zomato %>% left_join(Currency) 

#add in new column: USD Cost (for equal comparison)
zomato = zomato %>% mutate(Avg_Cost_USD = Average_Cost_For_Two*`Conversion Rate (USD)`) 
#Multiplied Currency Column by it's conversion rate

#replacing Cuisines col. with Principal_Cuisines (the first category of cuisines for every country row)
zomato= zomato %>% separate(Cuisines,into=c("Principal_Cuisines"),sep=',') #store "primary" cuisine types into new column and replace old with this

#Make sure rating categories are ordered:
zomato = zomato %>% mutate(Rating_Text=ordered(Rating_Text,levels=c("Excellent","Very Good","Good","Average","Poor")))
```

***

We begin by removing unnecessary columns and adding in necessary ones:

1. Add in countries by country code via the country code file provided by kaggle

2. Add in our own currency conversion file, allowing us to convert average cost for two into the USD amount and storing it in it's own column called "Avg Cost USD".

We notice there were many cuisines listed per restaurant, so to simplify things for analytic purposes, we extract the first cuisine listed by each restaurant.

### **Removing Missing Values** - Since we will focusing on the ratings of restaurants, we remove restaurants with missing aggregate rating values.

```{r echo=TRUE, warning=FALSE}
#---------REMOVE MISSING VALUES----------
#we notice that there are several "0" rated stores for even expensive places so we decided to remove no rating stores
zomato[zomato == 0] = NA #remove values that have zero
zomato[zomato == "Not rated"] = NA #remove unrated values
zomato = zomato %>% filter(!is.na(Aggregate_Rating))

india.data = zomato %>% filter(Country == "India")
other.data = zomato %>%  filter(Country!= "India")
```


Exploratory Analysis {data-icon="fa-chart-pie" .storyboard}
=====================================
### **Distribution Of Variables** 
```{r echo=F, fig.width=11}
library(RColorBrewer)
display.brewer.all() #check color palettes

plot_themes = theme(plot.title=element_text(family="Palatino", face="bold.italic", size=15),legend.title = element_text(face = "italic", family = "Palatino"), 
                  legend.text = element_text(face = "bold.italic",family = "Palatino"), 
                  axis.title = element_text(family = "Palatino", face="italic",size = (10)),
                  axis.text.y.left = element_text(family = "Palatino", colour = "darkgrey", size = (10)),
                  axis.text.x.bottom = element_text(angle =90,family = "Palatino", colour = "darkgrey", size = (10)))

bplot_themes = theme(plot.title=element_text(family="Palatino", face="bold.italic", size=15),legend.title = element_text(face = "italic", family = "Palatino"),
                     #plot.margin = margin(2,-0.5,2,-0.5, "cm"),
                    legend.text = element_text(face = "bold.italic",family = "Palatino"), 
                    axis.title = element_text(family = "Palatino", face="italic",size = (10)),
                    axis.text.y.left = element_text(family = "Palatino", colour = "darkgrey", size = (10)),
                    axis.text.x.bottom = element_blank())#element_text(angle =90,family = "Palatino", colour = "black", size = (2)))

getPalette= colorRampPalette(brewer.pal(9,"RdBu"))

#----Ratings----

# Rating Text Dist
Rating_TextDist=ggplot(zomato,aes(x=Rating_Text,fill=Country))+geom_bar()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Rating Text",x="Rating Text", y="Restaurant Frequency")+
  scale_x_discrete(limits=c("Poor","Average","Good","Very Good","Excellent"))
Rating_TextDist

# Aggregate Rating Distribution
AgRatingDist=ggplot(zomato,aes(x=Aggregate_Rating,fill=Country))+geom_histogram()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Aggregate Rating",x="Aggregate Rating", y="Restaurant Frequency")
AgRatingDist

grid.arrange(Rating_TextDist, AgRatingDist, nrow =2)  
```


```{r echo=F, fig.width=11}
#----Prices------

# Price Range
Price_RangeDist=ggplot(zomato,aes(x=Price_Range,fill=Country))+geom_bar()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Price Range",x="Price Range", y="Restaurant Frequency")
Price_RangeDist

# Votes
VotesDist =ggplot(zomato,aes(x=Votes,fill=Country))+geom_histogram(bins = 100) +plot_themes+
  scale_fill_manual(values = getPalette(15))+
  labs(title="Distribution of Votes",x="Votes",y="Restaurant Frequency")
#skewed right
VotesDist

# Average Cost 
AvgCostDist=ggplot(zomato,aes(x=Avg_Cost_USD,fill=Country))+geom_histogram()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Average Cost USD",x="Average Cost USD", y="Restaurant Frequency")
AvgCostDist 

grid.arrange(Price_RangeDist, VotesDist,AvgCostDist, nrow =2)  
```


```{r echo=F, fig.width=11}
#----Booking/Delivery------

# Table Booking
Table_BookingDist=ggplot(zomato,aes(x=Has_Table_Booking,fill=Country))+geom_bar()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Table Booking Option",x="Table Booking Option", y="Restaurant Frequency")
Table_BookingDist

# Online Delivery
Online_DeliveryDist=ggplot(zomato,aes(x=Has_Online_Delivery,fill=Country))+geom_bar()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Online Delivery Option",x="Online Delivery Option", y="Restaurant Frequency")
Online_DeliveryDist

# Delivering Now
DeliveryDist=ggplot(zomato,aes(x=Is_Delivering_Now,fill=Country))+geom_bar()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Delivery Option",x="Delivery Option", y="Restaurant Frequency")
DeliveryDist

grid.arrange(Table_BookingDist, Online_DeliveryDist,DeliveryDist, ncol=3) 
#this is cool and colourfull but legend is no share legend option
```


```{r echo=F, fig.width=11}
#----Location----
# Locality distribution
LocalityDist=ggplot(zomato,aes(x=Locality,fill=Country))+geom_bar()+
  bplot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of Locality",y="Restaurant Frequency")
LocalityDist #messy

# City Distribution
CityDist=ggplot(zomato,aes(x=City,fill=Country))+geom_bar()+
  bplot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of City",y="Restaurant Frequency")
CityDist #a little more clear

# Country Dist
CountryDist=ggplot(zomato,aes(x=Country,fill=Country))+geom_bar()+
  plot_themes+scale_fill_manual(values = getPalette(15))+
  labs(title = "Distribution of City",y="Restaurant Frequency")
CountryDist #very clear that india is way more dominant

grid.arrange(LocalityDist, CityDist, CountryDist, nrow =3) #may want to display seperately
```

***

INSERT COMMENT HERE

### ** How Variables React to Aggregate Rating**

```{r echo=F, fig.width=11}
# Agg Rating VS:

#------G21-------

# VS Average Cost USD
AgRating_AvgCosts=ggplot(zomato,aes(x=as.factor(Aggregate_Rating),y=Avg_Cost_USD))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Average Cost",x="Aggregate Rating", y="Average Cost")
AgRating_AvgCosts 

# VS Price Range 
AgRating_PriceRange=ggplot(zomato,aes(x=as.factor(Price_Range),y=Aggregate_Rating))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Price Range",x="Price Range",y="Aggregate Rating")
AgRating_PriceRange 

grid.arrange(AgRating_AvgCosts, AgRating_PriceRange, nrow =2)
```


```{r echo=F, fig.width=11}
#--------G22----------

# VS Votes (to test if popularity has an impact)
AgRating_Votes=ggplot(zomato,aes(x=as.factor(Aggregate_Rating),y=Votes))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Votes",x="Aggregate Rating",y="Votes")
AgRating_Votes
#so there is a steady increase 

# VS Country 
AgRating_Country=ggplot(zomato,aes(x=Country,y=Aggregate_Rating))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Country",x="Country",y="Aggregate Rating")
AgRating_Country

grid.arrange(AgRating_Votes, AgRating_Country, nrow =2) 
```



```{r echo=F, fig.width=11}
#-----G23----------

# VS Table Booking Option
AgRating_TableBooking=ggplot(zomato,aes(x=Has_Table_Booking,y=Aggregate_Rating))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Table Booking",x="Table Booking Option",y="Aggregate Rating")
AgRating_TableBooking

# VS Online Delivery 
AgRating_OnlineDelivery=ggplot(zomato,aes(x=Has_Online_Delivery,y=Aggregate_Rating))+geom_boxplot()+
  plot_themes+labs(title = "Aggregate Rating VS Online Delivery",x="Online Delivery Option",y="Aggregate Rating")
AgRating_OnlineDelivery #interesting, not much of an effect, prob because of india

# VS Delivering Now
AgRating_DeliveringNow=ggplot(zomato,aes(x=Is_Delivering_Now,y=Aggregate_Rating))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Delivering Now",x="Delivering Now",y="Aggregate Rating")
AgRating_DeliveringNow

grid.arrange(AgRating_TableBooking, AgRating_OnlineDelivery, AgRating_DeliveringNow, ncol =3) 
```

***
INSERT COMMENT HERE


### **India Compared to the World** - how does India behave in comparison to the other Countries?: Distribution

```{r echo=F, fig.width=11}
iplot_themes = theme(plot.title=element_text(family="Palatino", face="bold.italic", colour = "maroon", size=15),legend.title = element_text(face = "italic", family = "Palatino"), 
                    legend.text = element_text(face = "bold.italic",family = "Palatino"), 
                    axis.title = element_text(family = "Palatino", colour = "maroon",face="italic",size = (10)),
                    axis.text.y.left = element_text(family = "Palatino", colour = "darkgrey", size = (10)),
                    axis.text.x.bottom = element_text(angle =90,family = "Palatino", colour = "darkgrey", size = (10)))

#----->Distributions of India VS World-----

#Distribution of Ratings - how is india different from the world?

#-----Rating Text------
India_Rating_TextDist=ggplot(india.data,aes(x=Rating_Text))+geom_bar()+
  iplot_themes+
  labs(title = "Distribution of Rating Text: India",x="Rating Text", y="Restaurant Frequency")+
  scale_x_discrete(limits=c("Poor","Average","Good","Very Good","Excellent"))
India_Rating_TextDist

Other_Rating_TextDist=ggplot(other.data,aes(x=Rating_Text))+geom_bar()+
  plot_themes+
  labs(title = "Distribution of Rating Text: Other Countries",x="Rating Text", y="Restaurant Frequency")+
  scale_x_discrete(limits=c("Poor","Average","Good","Very Good","Excellent"))
Other_Rating_TextDist 

grid.arrange(India_Rating_TextDist, Other_Rating_TextDist, nrow =2) #very different
```


```{r echo=F, fig.width=11}
#------Aggregate Rating-------
India_AgRatingDist=ggplot(india.data,aes(x=Aggregate_Rating))+geom_bar()+
  iplot_themes+
  labs(title = "Distribution of Aggregate Rating: India",x="Aggregate Rating", y="Restaurant Frequency")
India_AgRatingDist

Other_AgRatingDist=ggplot(other.data,aes(x=Aggregate_Rating))+geom_bar()+
  plot_themes+
  labs(title = "Distribution of Aggregate Rating: Other Countries",x="Aggregate Rating", y="Restaurant Frequency")
Other_AgRatingDist 

grid.arrange(India_AgRatingDist, Other_AgRatingDist, nrow =2) #very different

#SHOW ALL 4 RATINGS TOGETHER
grid.arrange(Other_Rating_TextDist,India_Rating_TextDist,India_AgRatingDist, Other_AgRatingDist, nrow =4) #very different
```


```{r echo=F, fig.width=11}
# ------Cost-------
India_CostDist=ggplot(india.data,aes(x=Avg_Cost_USD))+geom_histogram()+
  iplot_themes+
  labs(title = "Distribution of Average Cost: India",x="Average Cost USD", y="Restaurant Frequency")
India_CostDist

Other_CostDist=ggplot(other.data,aes(x=Avg_Cost_USD))+geom_histogram()+
  plot_themes+
  labs(title = "Distribution of Average Cost: Other Countries",x="Average Cost USD", y="Restaurant Frequency")
Other_CostDist 

grid.arrange(India_CostDist, Other_CostDist, nrow =2) # costs of india is in a smaller range
```

***
INSERT COMMENT HERE


```{r echo=F, fig.width=11}
India_CityDist=ggplot(india.data,aes(x=City))+geom_bar()+
  iplot_themes+labs(title = "Distribution of Cities in India",x="Cities")
India_CityDist
```

### **India VS the World** - Let's see how the variabbles reacting to Aggregate Rating of India differ from how they react from other countries

```{r echo=F, fig.width=11}

# Cost-------- 
India_AgRating_AvgCost=ggplot(india.data,aes(x=Avg_Cost_USD,y=Aggregate_Rating))+geom_point()+
  iplot_themes+labs(title = "Aggregated Rating VS Average Cost USD in India",x="Average Cost USD",y="Aggregate Rating")
India_AgRating_AvgCost #you can find a lot of good places to eat that are beneath $30 dollars in india

Other_AgRating_AvgCost=ggplot(other.data,aes(x=Avg_Cost_USD,y=Aggregate_Rating))+geom_point()+
  plot_themes+labs(title = "Aggregated Rating VS Average Cost USD in Other Countries",x="Average Cost USD",y="Aggregate Rating")
Other_AgRating_AvgCost #have to spend more in other countries

grid.arrange(India_AgRating_AvgCost, Other_AgRating_AvgCost,ncol =2)
```


```{r echo=F, fig.width=11}
# VS Price Range------------
India_AgRating_PriceRange=ggplot(india.data,aes(x=as.factor(Price_Range),y=Aggregate_Rating))+geom_boxplot()+iplot_themes+
  labs(title = "Aggregate Rating VS Price Range: India",x="Price Range",y="Aggregate Rating")
India_AgRating_PriceRange

Other_AgRating_PriceRange=ggplot(other.data,aes(x=as.factor(Price_Range),y=Aggregate_Rating))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Price Range: Other Countries",x="Price Range",y="Aggregate Rating")
Other_AgRating_PriceRange

grid.arrange(India_AgRating_PriceRange, Other_AgRating_PriceRange,ncol =2)
```



```{r echo=F, fig.width=11}
#------->Ag Rating of India VS World
# Agg Rating VS

# Votes (popularity effect)-------
India_AgRating_Votes=ggplot(india.data,aes(x=as.factor(Aggregate_Rating),y=Votes))+geom_boxplot()+iplot_themes+
  labs(title = "Aggregate Rating VS Votes: India",x="Aggregate Rating",y="Votes")
ggplotly(India_AgRating_Votes)

Other_AgRating_Votes=ggplot(other.data,aes(x=as.factor(Aggregate_Rating),y=Votes))+geom_boxplot()+plot_themes+
  labs(title = "Aggregate Rating VS Votes: Other Countries",x="Aggregate Rating",y="Votes")
ggplotly(Other_AgRating_Votes) #heavier increase

grid.arrange(India_AgRating_Votes, Other_AgRating_Votes,nrow =2) # more of an impact in other countries
```

***
INSERT COMMENT HERE


### ***Exploring India Data Only** - where would the best place be in India

```{r echo=F, fig.width=11}
# Ag Rating VS

# Principal Cuisine-------
India_AgRating_PrincipalCuisines=ggplot(india.data,aes(x=Principal_Cuisines,y=Aggregate_Rating))+geom_boxplot()+
  iplot_themes+labs(title = "Aggregated Rating VS Cuisines in India",x="Cuisines",y="Aggregate Rating")
India_AgRating_PrincipalCuisines #highest rated genre of food is Modern foods

#City-----------
India_AgRating_City=ggplot(india.data,aes(x=City,y=Aggregate_Rating))+geom_boxplot()+
  iplot_themes+labs(title = "Aggregated Rating VS Cities in India",x="Cities",y="Aggregate Rating")
India_AgRating_City #bangalore is best 

#Cost---------
India_AgRating_AvgCost=ggplot(india.data,aes(x=Avg_Cost_USD,y=Aggregate_Rating))+geom_point()+
  iplot_themes+labs(title = "Aggregated Rating VS Average Cost USD in India",x="Average Cost USD",y="Aggregate Rating")
India_AgRating_AvgCost #you can find a lot of good places to eat that are beneath $30 dollars in india

grid.arrange(India_AgRating_PrincipalCuisines, India_AgRating_City,India_AgRating_AvgCost,nrow =3) #mot the best idea too squished
```


```{r echo=F, fig.width=11}

#Table Booking-----------
India_AgRating_TableBooking=ggplot(india.data,aes(x=Has_Table_Booking,y=Aggregate_Rating))+geom_boxplot()+iplot_themes+
  labs(title = "Aggregate Rating VS Table Booking in India",x="Table Booking Option",y="Aggregate Rating")
India_AgRating_TableBooking

#Online Delivery----------
India_AgRating_OnlineDelivery=ggplot(india.data,aes(x=Has_Online_Delivery,y=Aggregate_Rating))+geom_boxplot()+iplot_themes+
  labs(title = "Aggregate Rating VS Online Derlivery in India",x="Online Derlivery Option",y="Aggregate Rating")
India_AgRating_OnlineDelivery

#Is Delivering Now-------------
India_AgRating_DeliveringNow=ggplot(india.data,aes(x=Is_Delivering_Now,y=Aggregate_Rating))+geom_boxplot()+iplot_themes+
  labs(title = "Aggregate Rating VS Derlivery in India",x="Derlivery Option",y="Aggregate Rating")
India_AgRating_DeliveringNow

grid.arrange(India_AgRating_TableBooking, India_AgRating_OnlineDelivery,India_AgRating_DeliveringNow,ncol =3)
```

***
INSERT COMMENT HERE


First Stage Analysis {data-icon="fa-chart-bar" .storyboard}
=====================================
### **Linear Regression** - Model with all the possible predictor variables in our dataset.

```{r echo=FALSE}
india.data = india.data %>% mutate(Principal_Cuisines=fct_relevel(Principal_Cuisines,"North Indian"))
india.data = india.data %>% mutate(City=fct_relevel(City,"New Delhi"))

library(sjPlot)
fit1 = lm(Aggregate_Rating~City+Locality+Principal_Cuisines+Has_Table_Booking+Has_Online_Delivery+Price_Range+Avg_Cost_USD,data=india.data,weights=Votes)
tab_model(fit1, terms = c("(Intercept)"))
fit1.sum = as.data.frame(coef(summary(fit1)))
DT::datatable(fit1.sum, options = list(
  bPaginate = FALSE
))

```

***

Since we are aware most of our data comes from New Delhi and most common cuisine is North Indian, we set our baseline for City and Principal Cuisines in respect to this. The first linear regression model includes all the following variables:

- Aggregate Rating

- City

- Principal Cuisine

- Table Booking Option

- Online Delivery Option

- Price Range

- Average Cost for Two (USD)

This model yields a R squared value of 59.73% and is statistically significant. Included on the left is a table containing all the coefficent estimates for this model. When ordering p-values from lowest to highest, notice majority of the variable estimates returns p-values higher than 0.05.

### **Diagnostic Graphs (1)** - Checking the model's residual fitted graph for accuracy of this model
```{r echo=F, fig.width=11}
  ggplot(data=fit1,aes(y=.resid,x=.fitted))+geom_point()+geom_smooth(se=F)+theme_bw()
```

***

Our residuals seem to be rather random, with the bulk of it being around zero. The fitted line helps illustrate the overall trend of the residuals which in our case seems to fall slightly below zero. 

### **Diagnostic Graphs (2)** - Checking normality of residuals via QQ Plot
```{r echo = F, fig.width=11}
p2 <-ggplot(fit1,aes(sample=.resid))+stat_qq()+
  stat_qq_line()+xlab("Theoretical Quantiles")+ylab("Standardized Residuals")
p2<-p2+ggtitle("Normal Q-Q")+theme_bw()

p2
```

***
QQ Plot shows that the center region of the residuals are normal but the lower tail deviates heavily.

### **Diagnostic Graphs (3)** - Checking for homoscedasticity via Scale Location
```{r echo=F, fig.width=11}
plot(fit1,3)
```

***

We ideally want to observe a rather straight horizontal line for this plot. Unfortunately, this scale location plot suggests heavy heteroscedasticty.

### **Drop Test** - Checking if any variables can be dropped from the current model
```{r echo=F}
drop1(fit1,test="F")
```

***

Conducting a drop test using F statistic shows we can drop the variable *Price Range* from our model without it being affected.

Second Stage Analysis {data-icon="fa-chart-line" .storyboard}
=====================================

### **Transforming Response Variable** - We apply our first transformation to the dependent variable and observe its effects.

```{r echo=TRUE}
india.data = india.data %>% mutate(Transformed_Rating = log10(Aggregate_Rating/(5-Aggregate_Rating)))
```

```{r echo=F}
fit.t = lm(Transformed_Rating~City+Locality+Principal_Cuisines+Has_Table_Booking+Has_Online_Delivery+Avg_Cost_USD,data=india.data,weights=Votes)
tab_model(fit.t, terms = c("(Intercept)"))
fitt.sum = as.data.frame(coef(summary(fit.t)))
DT::datatable(fitt.sum, options = list(
  bPaginate = FALSE
))

```

***

We add a new transformed variable to dataset called "Transformed Rating". Since our Aggregate Rating level is limited between the values 0 to 5, we set a restriction that prevents rating estimates to go above or beyond this range. The transformation is as follows: $$log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}$$

The transformed linear regression model returns an improved R squared value of 66.2% (a 7% increase from original model).

### **Drop Test** - Checking if any variables can be dropped from the transformed model
```{r echo=F}
drop1(fit.t,test="F")
```

***

Conducting a drop test using F statistic shows we can now drop the variable *Has Online Delivery* from our transformed model without it being affected.

### **Diagnostic Graphs (1)** - Fitting a residuals graph to check accuracy of the transformed model

```{r echo=F, fig.width=14}
fit.2t = lm(Transformed_Rating~City+Locality+Principal_Cuisines+Has_Table_Booking+Avg_Cost_USD,data=india.data,weights=Votes)

ggplot(data=fit.2t,aes(y=.resid,x=.fitted))+geom_point()+geom_smooth(se=F)+theme_bw()

```

### **Diagnostic Graphs (2)** - Checking normality of residuals via QQ Plot
```{r echo = F, fig.width=11}
p2t <-ggplot(fit.2t,aes(sample=.resid))+stat_qq()+
  stat_qq_line()+xlab("Theoretical Quantiles")+ylab("Standardized Residuals")
p2t<-p2t+ggtitle("Normal Q-Q")+theme_bw()

p2t
```

***

Our QQ Plot illustrates that most of the data does follow a relativly normal distribution of residuals. However, as with our original model, the upper and lower tails deviate from normality. 

It's important to note that most of our variables are heavily skewed, causing these large influenctial outliers. Since we are working with a large sample size, the normality shown by this graph doesn't raise much red flags.


### **Diagnostic Graphs (3)** - Checking for homoscedasticity
```{r echo=F, fig.width=11}
p3t<-ggplot(fit.2t, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3t<-p3t+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3t<-p3t+ylab("SquareRoot|Standardized residuals|")
p3t<-p3t+ggtitle("Scale-Location")+theme_bw()
p3t
```

***

The Scale Location Plot shows that there is still clear heteroscedasticity. Ideally, we want the plots to demonstrate a relatively straight horizontal line - indicating homoscedasticity. Further analysis of our variables can explain how alaraming it is (or isn't).

### **Checking Individual Variables (Average Cost for Two)** - The linear regression of the Transformed Rating with Average Cost for Two returns a residual plot that follows some form of pattern

```{r echo = F, fig.width=14}
costreg = lm(Transformed_Rating~Avg_Cost_USD,data=india.data,weights = Votes)
#check residual fitted plot
ggplot(data=costreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F)
```

### **Checking Individual Variables (Average Cost for Two)** - Referring back to the original plot graph between cost and rating, recall that an exponential relatioship was present. Taking the log of cost helped randomize the residual plot a lot more.

```{r echo = F, fig.width=14}
costreg2 = lm(Transformed_Rating~log(Avg_Cost_USD),data=india.data,weights = Votes)
ggplot(data=costreg2,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F)

```

### **City Regression** - Investigating if the predictor variable City can be transformed to improve the model

```{r echo = F, fig.width=11}
cityreg = lm(Transformed_Rating~City,data=india.data,weights = Votes)
#check residual fitted plot
ggplot(data=cityreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F)

```

***

At first glance, there seems to be some form of pattern with the residuals for the predictor variable: City. However, this residual plot is not concern worthy as the dips illustrated by the fitted line is mainly caused by lack of data for certain parts of the graph. Otherwise, the plots are rather random.


### **Locality Regression** - The residual plot for the regression of Transformed Rating ~ Locality is random and has a mean difference of 0 for the most part

```{r echo = F, fig.width=14}
locreg = lm(Transformed_Rating~Locality,data=india.data,weights = Votes)
#check residual fitted plot
ggplot(data=locreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F)
```

### **Principal Cuisines Regression** - Similar to the regression for City, we experience dips in certain areas of the residual graph due to missing data

```{r echo = F, fig.width=14}
cuisinereg = lm(Transformed_Rating~Principal_Cuisines,data=india.data,weights = Votes)
#check residual fitted plot
ggplot(data=cuisinereg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F)

```


### **Booking Regression** - The traditional residual graph for the variable: Has Table Booking is hard to interpret as it is a binary variable

```{r echo = F, fig.width=14}
bookreg = lm(Transformed_Rating~Has_Table_Booking,data=india.data,weights = Votes)
#check residual fitted plot
ggplot(data=bookreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F)
```


### **Booking Regression (2)** - Using boxplots to graph the differences between the residuals and variable values, we see the median is well below zero - residuals not randomly scattered

```{r echo = F, fig.width=13}
#boxplot - booking vs residual
bookreg.aug = augment(bookreg) 
a = ggplot(bookreg.aug,aes(x=Has_Table_Booking,y=.resid))+geom_boxplot()

b = ggplot(bookreg.aug,aes(sample=.resid))+stat_qq()+
  stat_qq_line()+facet_wrap(~Has_Table_Booking)
grid.arrange(a,b)
```

### **Statistically Significant Predictors** - List of *localities* that are significantly better or worse than baseline

```{r echo=FALSE}
tidy(fit.2t) %>% filter(p.value<0.05) %>% filter(str_detect(term,"Locality")) %>%  arrange(estimate)
```

### **Statistically Significant Predictors** - List of *Cities* that are significantly better or worse than baseline
```{r echo=F}
reg.city = tidy(fit.2t) %>% filter(p.value<0.05) %>% filter(str_detect(term,"City")) %>%  arrange(term)
head (reg.city, 2)
```


Conclusion {data-icon="fa-book"}
=====================================
The dataset we chose contained recorded information on 9551 different restaurants from around the world. A sample size this large is bound to contain useful information pertaining to restaurant performance.

We began with data visualization in order to let the data paint its own picture and determine what areas are of interest. At first glance we notice that our data is heavily skewed, with more than 80% of the restaurants being located in India. With such a heavy skew, analyzing the data as a whole would be a mistake since the results of the other countries will be influenced by India.

As such, we isolate the restaurants in India from the rest of the world and further investigate. Our key indicator of restaurant performance is its rating measure - both Rating Text and Aggregate Rating - the first allowing for a general overview and the latter for more accurate regression analysis.

After identifying these trends among restaurants in India, we move on to analyze what variables actually affect restaurant performance. We identified a skew distribution in number of votes per restaurant during our exploratory analysis and take this into consideration when running our regression. We use votes as a weight parameter for our regression in order to obtain a better analysis. 

Since we are working with a set range of ratings: 0 to 5, we set a limit to restrict regression estimates from going above or below this level and use this as our transformed response variable:
$$0